Red wine exploration by Thuy Quach

Abstract

Why some red wines taste better than others? Just because the wine tasters say so or there is another way to tell. Can we tell what make great wine or bad wine from their chemical properties? And if yes, under what conditions the quality of red wines is the best.

This is what we are going to explore: relationship of chemical properties with wine quality.

The analysis included: data structure, statistical summary, distribution plots, box plots of each variables vs. quality, correlation matrix and scatter plots, final plots and data exploring the strong correlated variables, and reflections.

Dataset

The data set using in this analysis can be found here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt.

## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpekgrpp/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpekgrpp/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpekgrpp/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpekgrpp/downloaded_packages
## [1] "/Users/thuy/Google Drive/Data-analysis-with-R"

Summary of the data

First, let’s see the total of the wine data is:

## [1] 1599

samples.

Then, let’s explore the all variables.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

X is data entry number and quality is the output of the analysis. So, there were 11 total variables. The data is in wide format.

How is about the structure of the data?

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Quality was measured as factor integer. All other variables were numerical data.

Statistical summary of the data was shown below.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality was range from 3 to 8. Residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide had very large range of data. Do these variables influence wine quality?

Univariate Analysis

Distribution of individual variables by histogram and density:

First, let us explore the distributions of each variables using ggplot.

The data is in the format of wide data which make difficult for R to draw multiple variable plots. Therefore, I reshaped the data into long format.

# reshape data into long format
long_data <- melt(redwine, id.vars=c("X", "quality")) 

All variables

Some of the variables seem to follow normal distribution such as density, pH, alcohol, volatile.acidity, sulphates and fix.acidity. Few others were right skewed distribution such as residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, sulphate, chloride.

Quality

Most of the wine samples had wine quality of 5 and 6. Let’s get the real number.

# calculate the % of wine with quality 5 and 6
100*count(subset(redwine, quality == 5 | quality == 6))/length(redwine$quality)
##          n
## 1 82.48906

There was 82.49 % of wines had quality of 5 or 6.

Data correlations

Let us run the correlation matrix to see what chemical properties have strong relationships with wine quality and also with each others using ggpairs. It was difficult to plot ggpairs on all variables because the space allotted to the plot couldn’t hold 12^2 variables, so I created three groups and made sure that the variable “quality” (col 13) was presented in all.

We learned that any correlation above 0.3 is meaningful and 0.7 is pretty strong. Let us see if we could find any in the below results.

Correlation efficient between quality with volatile.acidity was -0.391, citric.acid with fixed.acidity was 0.672, citric.acid with volatile.acidity was -0.552.

Correlation efficient between total.sulfur.dioxide and free.sulfur.dioxide was 0.668.

Correlation efficient between quality and alcohol was 0.476, pH and density was -0.342.

Bivariate analysis plots

What chemical properties correlated with each others?

From previous data correlations analysis, we found that there were some chemical strongly correlated with each others. Let explore them by scatter plot and linear regression line.

Citric.acid and fixed.acidity

Statistic summary of citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Statistic summary of fixed.acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

As citric.acid increased, the fixed.acidity increased. Citric.acid ranged from 0 to 1 g/dm^3 while fixed.acidity ranged from 4.5 to 15.9 g/dm^3. It also could be explainable since citric.acid is an acid that leads to increased the fixed.acidity of the wine. Previous correlation analysis supported the results as correlation coefficient of the two chemical properties was 0.672.

Total.sulfur.dioxide and free.sulfur.dioxide

Statistic summary total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Statistic summary of free.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

As total.sulfur.dioxide increased, the free.sulfur.dioxide increased. Total.sulfur.dioxide ranged from 6 to 289 g/dm^3 while free.sulfur.dioxide ranged from 1 to 72 g/dm^3. It also could be explainable since free.sulfur.dioxide is a part of the total.sulfur.dioxide. Previous correlation analysis supported the results as correlation coefficient the two chemicals was 0.668.

pH and density:

Statistical summary of pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Statistical summary of density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Though the correlation was not strong, we could notice that as pH increased, the density increased. pH ranged from 2.74 to 4.01 while density ranged from 1.004 to 0.990. The range of density was very small (around 0.014). Previous correlation analysis supported the results as correlation coefficient the two chemical properties was -0.342.

What chemical properties influence wine quality?

From the above correlation analysis, I found only alcohol and volatile.acidity had correlation coefficients bigger than 0.3 with quality. Since we are interested in what make best wine, it is important to consider some other chemical properties which may have some impacts.

Let’s see the below results.

##                             [,1]
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632

We could see that there were 6 chemical properties (volatile.acidity, total.sulfur.dioxide, pH, free.sulfur.dioxide, density, chlorides) have negative correlation with quality. It suggested that those chemical properties make wine taste worse. Among those properties, volatile.acidity had the most impact with correlation of -0.391. While sulphates, residual.sugar, fixed.acidity citric.acid, alcohol make wine taste better. Among those properties, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.251, 0.226 and 0.476 respectively.

Correlation of chemical properties vs. wine quality by boxplots:

From the box plots, it looked like alcohol, sulphates, volatile.acidity and citric.acid might have impacts on the quality of wines. The results were consistent with previous correlation analysis.

Let’s zoom the plots of these chemical properties up.

Alcohol and quality

As the wine quality increase from 3 to 8, there was an increase in average of alcohol, except for quality of 5. We also could see that wine with quality of 5 has many outliers.

Let’s compare the distributions of alcohol for different wine qualities

The distribution of alcohol were similar and almost normal for all wine qualities except 5 where the distribution was much narrower.

Let’s see the summary of its alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9

The mean of alcohol for quality of 5 was 9.89.

Let’s compare with other qualities

quality_vs_alcohol <- redwine %>%
  group_by(quality) %>%
  summarize(avg_alcohol = mean(alcohol)) %>%
  arrange(avg_alcohol)

quality_vs_alcohol
## Source: local data frame [6 x 2]
## 
##   quality avg_alcohol
##     (int)       (dbl)
## 1       5    9.899706
## 2       3    9.955000
## 3       4   10.265094
## 4       6   10.629519
## 5       7   11.465913
## 6       8   12.094444

The average alcohol was increased from 9.955 to 11.094 (1.2 times) when wine quality increased from 3 to 8, except for quality of 5 where the average alcohol was 9.899.

Citric.acid and quality

As the wine quality increase from 3 to 8, there was an increase in average of citric.acid

Let’s compare the distributions of citric.acid for different wine qualities

We could see the mean of citric.acid shifted to the right with wine quality increased.

Let’s summary and arrange the mean of citric.acid

quality_vs_citric.acid <- redwine %>%
  group_by(quality) %>%
  summarize(avg_citric.acid = mean(citric.acid)) %>%
  arrange(avg_citric.acid)

quality_vs_citric.acid
## Source: local data frame [6 x 2]
## 
##   quality avg_citric.acid
##     (int)           (dbl)
## 1       3       0.1710000
## 2       4       0.1741509
## 3       5       0.2436858
## 4       6       0.2738245
## 5       7       0.3751759
## 6       8       0.3911111

It was clearly to see the average value of citric.acid increased from 0.171 to 0.391 (2.3 times) when quality increased from 3 to 8.

Sulphates and quality

As the wine quality increase from 3 to 8, there was an increase in average of sulphates.

Let’s compare the distributions of citric.acid for different wine qualities

We could see the distributions of sulphates were similar and the mean of sulphates shifted to the right with wine quality increased.

Let’s summary and arrange the mean of sulphates

quality_vs_sulphates <- redwine %>%
  group_by(quality) %>%
  summarize(avg_sulphates = mean(sulphates)) %>%
  arrange(avg_sulphates)

quality_vs_sulphates
## Source: local data frame [6 x 2]
## 
##   quality avg_sulphates
##     (int)         (dbl)
## 1       3     0.5700000
## 2       4     0.5964151
## 3       5     0.6209692
## 4       6     0.6753292
## 5       7     0.7412563
## 6       8     0.7677778

It was clearly to see the average value of sulphates increased from 0.570 to 0.768 (1.3 times) when quality increased from 3 to 8.

Volatile.acidity and quality

As the wine quality increase from 3 to 8, there was an decrease in volatile.acidity.

Let’s compare the distributions of volatile.acidity for different wine qualities

We could see the distributions of volatile.acidity were similar and the mean of volatile.acidity shifted to the right with wine quality increased.

Let’s summary and arrange the mean of volatile.acidity

quality_vs_volatile.acidity <- redwine %>%
  group_by(quality) %>%
  summarize(avg_volatile.acidity = mean(volatile.acidity)) %>%
  arrange(avg_volatile.acidity)

quality_vs_volatile.acidity
## Source: local data frame [6 x 2]
## 
##   quality avg_volatile.acidity
##     (int)                (dbl)
## 1       7            0.4039196
## 2       8            0.4233333
## 3       6            0.4974843
## 4       5            0.5770411
## 5       4            0.6939623
## 6       3            0.8845000

It was clearly to see the average value of volatile.acidity decreased from 0.884 to 0.404 (2.2 times) when quality increased from 3 to 8.

Summary of bivariate analysis:

There were strong correlations among the chemical properties such as citric.acid with fixed.acidity (0.672), citric.acid with volatile.acidity (-0.552), total.sulfur.dioxide and free.sulfur.dioxide (0.668), and pH and density (-0.342).

There were also strong correlations of some chemicals with quality such as quality with volatile.acidity (-0.391), quality and alcohol (0.476), quality and sulphates (0.251), quality and citric.acid (0.226).

Multivariate Plots Section

It is important to investigate multivariate analysis. As previous bivariate analysis, we found that some chemical correlated well with each others or with quality. In this section, we analyzed how our feature of interest - quality varies with other chemical properties.

In order to see simplify and see clearer relationships, I grouped the quality by their average chemical properties and add a new rating variable which groups the quality into three groups.

Average of all variables grouped by quality

## Source: local data frame [6 x 12]
## 
##   quality avg_alcohol avg_citric.acid avg_sulphates avg_volatile.acidity
##     (int)       (dbl)           (dbl)         (dbl)                (dbl)
## 1       5    9.899706       0.2436858     0.6209692            0.5770411
## 2       3    9.955000       0.1710000     0.5700000            0.8845000
## 3       4   10.265094       0.1741509     0.5964151            0.6939623
## 4       6   10.629519       0.2738245     0.6753292            0.4974843
## 5       7   11.465913       0.3751759     0.7412563            0.4039196
## 6       8   12.094444       0.3911111     0.7677778            0.4233333
## Variables not shown: avg_fixed.acidity (dbl), avg_pH (dbl),
##   avg_residual.sugar (dbl), avg_density (dbl), avg_total.sulfur.dioxide
##   (dbl), avg_free.sulfur.dioxide (dbl), avg_chlorides (dbl)

The above table showed the average value for each chemical properties for every wine quality.

Let’s see how the variables vary with quality and each others.

# reshape data into long format
long_data_avg <- melt(quality_vs_total_variables, id.vars=c("quality")) 

Group the quality in three groups using new variable rating

# turn data in to data.table
wine_table <- data.table(redwine)

# add new rating variable
wine_table[, rating := ifelse(quality <=4, "bad",
                       ifelse(quality >=5 & quality <=6, "good",
                       ifelse(quality >=7, "very good", NA)))]

Let’s summarize the wine by rating.

wine_table %>%
  group_by(rating) %>%
  summarize(n_obs = n())
## Source: local data table [3 x 2]
## 
##      rating n_obs
##       (chr) (int)
## 1      good  1319
## 2 very good   217
## 3       bad    63

So, there was 217 very good wines, 1319 good wines and 63 bad wines.

Average of all variables grouped by rating

## Source: local data table [3 x 12]
## 
##      rating avg_alcohol avg_citric.acid avg_sulphates avg_volatile.acidity
##       (chr)       (dbl)           (dbl)         (dbl)                (dbl)
## 1       bad    10.21587       0.1736508     0.5922222            0.7242063
## 2      good    10.25272       0.2582638     0.6472631            0.5385595
## 3 very good    11.51805       0.3764977     0.7434562            0.4055300
## Variables not shown: avg_fixed.acidity (dbl), avg_pH (dbl),
##   avg_residual.sugar (dbl), avg_density (dbl), avg_total.sulfur.dioxide
##   (dbl), avg_free.sulfur.dioxide (dbl), avg_chlorides (dbl)

The table show the average value of each chemical properties for each wine rating.

# reshape data into long format
long_data_avg_rating <- melt(rating_vs_total_variables, id.vars=c("rating")) 

Citric.acid and fixed.acidity correlation code by quality

## Source: local data frame [6 x 3]
## 
##   avg_fixed.acidity avg_citric.acid quality
##               (dbl)           (dbl)   (int)
## 1          8.566667       0.3911111       8
## 2          8.872362       0.3751759       7
## 3          8.347179       0.2738245       6
## 4          8.167254       0.2436858       5
## 5          7.779245       0.1741509       4
## 6          8.360000       0.1710000       3

We could clearly see the trend that the higher the wine rating the higher of both avg_fixed.acidity and avg_citric.acid were. Increasing average fixed.acidity from 7.78 to 8.57 and average citric.acid from 0.17 to 0.39 lead to increase wine quality from 4 to 8. It is supported that with both fix.acidity and citric.acid were strongly correlated with correlation coefficient of 0.672, and both chemicals were also correlated with quality with correlation of 0.124 and 0.226 respectively.

Total.sulfur.dioxide and free.sulfur.dioxide code by quality

We could see the correlation of free.sulfur.dioxide and total.sulfur.dioxide but not with the quality. It was interesting to note that the wine quality was best with the middle range of both chemical properties (14 and 35 respectively).

## Source: local data frame [6 x 3]
## 
##   avg_free.sulfur.dioxide avg_total.sulfur.dioxide quality
##                     (dbl)                    (dbl)   (int)
## 1                13.27778                 33.44444       8
## 2                14.04523                 35.02010       7
## 3                15.71160                 40.86991       6
## 4                16.98385                 56.51395       5
## 5                12.26415                 36.24528       4
## 6                11.00000                 24.90000       3

It was also noted that with the average total.sulfur.dioxide were similar in both bad wine and very good wine while the free.sulfur.dioxide were around 2 g/dm^3 higher in very good wine. When the both concentration of the chemicals increased further, the wine quality reduced. It was suggested that low concentration of the chemicals make wine taste bad, however too much of them (above 35 g/dm^3 for total.sulfur.dioxide, 14 g/dm^3 for free.sulfur.dioxide ) reduced wine quality.

pH and density code by quality:

pH and density was slightly correlated with each other but not with quality. Low concentration of both pH and density lead to higher quality. Higher pH seems reduce quality while it was not clear in density.

## Source: local data frame [6 x 3]
## 
##     avg_pH avg_density quality
##      (dbl)       (dbl)   (int)
## 1 3.267222   0.9952122       8
## 2 3.290754   0.9961043       7
## 3 3.318072   0.9966151       6
## 4 3.304949   0.9971036       5
## 5 3.381509   0.9965425       4
## 6 3.398000   0.9974640       3

We could see that the density was changed in a range from 0.997 to 0.995 g/dm^3. It was very small range. So, it could say that density has very little impact on quality. And pH changed from 3.398 to 3.267 while wine quality increased from 3 to 8. So, we could conclude that pH and quality has negative correlation.

Alcohol and sulphates code by quality:

Sulphates and alcohol strongly correlated with each other. Increasing sulphates from 0.57 to 0.77 and alcohol from 9.96 to 12.09 lead to increase quality from 3 to 8.

## Source: local data frame [6 x 3]
## 
##   avg_sulphates avg_alcohol quality
##           (dbl)       (dbl)   (int)
## 1     0.7677778   12.094444       8
## 2     0.7412563   11.465913       7
## 3     0.6753292   10.629519       6
## 4     0.6209692    9.899706       5
## 5     0.5964151   10.265094       4
## 6     0.5700000    9.955000       3

Volatile.acidity and total.sulfur.dioxide code by quality:

## Source: local data frame [6 x 3]
## 
##   avg_total.sulfur.dioxide avg_volatile.acidity quality
##                      (dbl)                (dbl)   (int)
## 1                 33.44444            0.4233333       8
## 2                 35.02010            0.4039196       7
## 3                 40.86991            0.4974843       6
## 4                 56.51395            0.5770411       5
## 5                 36.24528            0.6939623       4
## 6                 24.90000            0.8845000       3

Total.sulfur.dioxide and volatile.acidity were not correlated with each other. The total.sulfur.dioxide was low (around 25 to 35) for quality from 3-4 and 7-8. While volatile.acidity was strongly negative correlated with quality. The volatile.acidity was decreased from 0.84 to 0.42 while quality increased from 3 to 8.

Multivariate Summary

pH and density was slightly correlated with each other but not with quality. Low concentration of both pH and density lead to higher quality. Higher pH seems reduce quality while it was not clear in density.

Total.sulfur.dioxide and volatile.acidity were not correlated with each other. Sulfur.dioxide was not correlated with wine quality while volatile.acidity was.

Some chemical correlated well with quality but not each others such as free.sulfur.dioxide and total.sulfur.dioxide. It was interesting to note that the wine quality was best with the middle range of both chemical properties (14 and 35 respectively).

Some chemical properties strongly correlated with each others and with wine quality, particularly:

  • Fixed.acidity and citric.acid strongly correlate with each other. Increasing average fixed.acidity from 7.78 to 8.57 and average citric.acid from 0.17 to 0.39 lead to increase wine quality from 4 to 8.

  • Sulphates and alcohol strongly correlated with each other. Increasing average sulphates from 0.57 to 0.77 and average alcohol from 9.96 to 12.09 lead to increase quality from 3 to 8.

Final Plots and Summary

We have explored the red wine data with many interesting questions about the data structures, data summary and how chemical properties vary with each others and with our feature of interest- quality. We have did statistical analysis and many different kinds of plots such as histogram, box plots, bar graph, etc. Let’s summarized the findings in there plots.

Plot One: Chemical properties highly influence wine quality

From plot 1, we could see that alcohol, citric.acid, fixed.acidity and sulphates positively influenced wine quality (green bar). Among those properties, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.251, 0.226 and 0.476 respectively.

Volatile.acidity, total.sulfur.dioxide, density, chlorides negatively influenced wine quality (red bar). Among those properties, volatile.acidity had the strongest impact with correlation of -0.391.

Plot Two

After finding alcohol and volatile.acidity have strongest impacts on wine quality. Let’s summarize their relationships with wine quality. The below plots were selected and improved from bivariate plots section.

Statistical summary of average alcohol and volatile.acidity vary with quality:

## Source: local data frame [6 x 3]
## 
##   quality avg_alcohol avg_volatile.acidity
##     (int)       (dbl)                (dbl)
## 1       3    9.955000            0.8845000
## 2       4   10.265094            0.6939623
## 3       5    9.899706            0.5770411
## 4       6   10.629519            0.4974843
## 5       8   12.094444            0.4233333
## 6       7   11.465913            0.4039196

Increasing volatile.acidity from 0.40 to 0.88 significantly reduced wine quality from 8 to 3, while increasing alcohol from 9.96 to 11.47 increased wine quality from 3 to 8. This results were consistent with the correlation findings where volatile.acidity had correlation coefficient of -0.391 while alcohol’s was 0.476. I suggested to use the two chemical properties as main features for quality predicting model.

Plot Three

Next, let’s see among the chemical properties there was any strong correlations with each others and also with quality. The below plots were selected and improved from the multivariate plots section.

It was noted that I grouped the wine quality into 3 groups: bad (quality of 3 and 4 quality), good (quality of 5 and 6) and very good (quality of 7 and 8).

Statistic summary of average alcohol, sulphates, citric.acid and fixed.acidity vary with quality.

##      rating avg_alcohol avg_sulphates avg_citric.acid avg_fixed.acidity
## 1       bad    10.21587     0.5922222       0.1736508          7.871429
## 2      good    10.25272     0.6472631       0.2582638          8.254284
## 3 very good    11.51805     0.7434562       0.3764977          8.847005

We found that:

  • Fixed.acidity and citric.acid strongly correlate with each other. Increasing average fixed.acidity from 7.87 to 8.85 and average citric.acid from 0.17 to 0.38 lead to increase wine rating from bad to very good.

  • Sulphates and alcohol strongly correlated with each other. Increasing average sulphates from 0.59 to 0.74 and average alcohol from 10.21 to 11.51 lead to increase wine rating from bad to very good.

It was interesting to note that the four chemical properties were highly correlated with wine quality as showed in plot 1. Particularly, fixed.acidity, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.124, 0.251, 0.226 and 0.476 respectively.

When we run modeling for predicting the quality we should careful select the features so two or three features are not too correlated.

Reflection